Using Statistical Translation Models for Bilingual IR
نویسندگان
چکیده
This report describes our test on using statistical translation models for bilingual IR tasks in CLEF-2001. These translation models have been trained on a set of parallel web pages automatically mined from the Web. Our goal is to compare the following approaches: using the original parallel corpora or a cleaned corpora to train translation models; using the raw translation probabilities to weigh query words or combine the probabilities with IDF; using different cut-off probability values in the translation models (i.e. delete the translations lower than a threshold). Our results show that: the models trained on the original parallel corpus work better than those on the cleaned corpora; the combination of the probabilities with IDF is beneficial; and it is better to cut-off the translation models at a certain value (0.01 in our case) than not cut them.
منابع مشابه
Bilingual phrases for statistical machine translation
The statistical framework has proved to be very successful in machine translation. The main reason for this success is the existence of powerful techniques that allow to build machine translation systems automatically from available parallel corpora. Most of statistical machine translation approaches are based on single-word translation models, which do not take bilingual contextual information...
متن کاملStatistical Approach With Factored Translation Models For Indian Languages
Factored translation models are an extension to phrase based statistical translation models which integrate additional annotation at word level. Here we present a study of statistical models and approaches to translate Hindi to English. Experiments were also conducted on alignment models using various word groupings and using GIZA++ to predict their English translations and fertility. TAJ A new...
متن کاملTranslingual Information Retrieval: Learning from Bilingual Corpora
Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more diierent languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR appr...
متن کاملBilingual Word Spectral Clustering for Statistical Machine Translation
In this paper, a variant of a spectral clustering algorithm is proposed for bilingual word clustering. The proposed algorithm generates the two sets of clusters for both languages efficiently with high semantic correlation within monolingual clusters, and high translation quality across the clusters between two languages. Each cluster level translation is considered as a bilingual concept, whic...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001